Tokenization As The Initial Phase In NLP
نویسندگان
چکیده
In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed.
منابع مشابه
Multi-word tokenization for natural language processing
Sophisticated natural language processing (NLP) applications are entering everyday life in the form of translation services, electronic personal assistants or open-domain question answering systems. The more voice-operated applications like these become commonplace, the more expectations of users are raised to communicate with these services in unrestricted natural language, just as in a normal...
متن کاملTechniques for Arabic Morphological Detokenization and Orthographic Denormalization
The common wisdom in the field of Natural Language Processing (NLP) is that orthographic normalization and morphological tokenization help in many NLP applications for morphologically rich languages like Arabic. However, when Arabic is the target output, it should be properly detokenized and orthographically correct. We examine a set of six detokenization techniques over various tokenization sc...
متن کاملApplying Natural Language Processing Techniques for Effective Persian- English Cross-Language Information Retrieval
Much attention has recently been paid to natural language processing in information storage and retrieval. This paper describes how the application of natural language processing (NLP) techniques can enhance cross-language information retrieval (CLIR). Using a semi-experimental technique, we took Farsi queries to retrieve relevant documents in English. For translating Persian queries, we used a...
متن کاملMACAON An NLP Tool Suite for Processing Word Lattices
MACAON is a tool suite for standard NLP tasks developed for French. MACAON has been designed to process both human-produced text and highly ambiguous word-lattices produced by NLP tools. MACAON is made of several native modules for common tasks such as a tokenization, a part-of-speech tagging or syntactic parsing, all communicating with each other through XML files . In addition, exchange proto...
متن کاملThe TextPro Tool Suite
We present TextPro, a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts. The suite has been designed so as to integrate and reuse state of the art NLP components developed by researchers at FBK. The current version of the tool suite provides functions ranging from tokenization to chunking and Named Entity Recognition (NER). The system‟s architect...
متن کامل